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Abstract 

We present a system that enables rapid model 
experimentation for tera-scale machine learn- 
ing with trillions of non-zero features, billions 
of training examples, and millions of param- 
eters. Our contribution to the literature is a 
new method (SA L-BFGS) for changing batch 
L-BFGS to perform in near real-time by using 
statistical tools to balance the contributions of 
previous weights, old training examples, and 
new training examples to achieve fast conver- 
gence with few iterations. The result is, to 
our knowledge, the most scalable and flexible 
linear learning system reported in the literature, 
beating standard practice with the current best 
system (Vowpal Wabbit and AllReduce). Using 
the KDD Cup 2012 data set from Tencent, Inc. 
we provide experimental results to verify the 
performance of this method. 

1 Introduction 

The demand for analysis and predictive modeling derived 
from very large data sets has grown immensely in recent 
years. One of the big problems in meeting this demand is 
the fact that data has grown faster than the availability of 
raw computational speed. As such, it has been necessary 
to use intelligent and efficient approaches when tackling 
the data training process. Specifically, there has been 
much focus on problems of the form 

m 

min V/(6'^a;(*);y«) + ASfe*), (1) 
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where a;^*-* 



is the feature vector of the ith exam- 



ple, e {0, 1} is the label, 6* £ M' is the vector 
of fitting parameters, I is a smooth convex loss function 
and S a regularizer. Some of the more popular methods 
for determining 9 include linear and logistic regression, 
respectively. The optimal such 9 corresponds to a linear 
predictor function pe{x) — d'^x that best fits the data 
in some appropriate sense, depending on I and S. We 
remark that such cost functions in ^ have a structure 
which is naturally decomposable over the given training 
examples, so that all computations can potentially be run 
in parallel over a distributed environment. 

However, in practice it is often the case that the model 
must be updated accordingly as new data is acquired. 
That is, we want to answer the question of how 9 changes 
in the presence of new training examples. One naive 
approach would be to completely redo the entire data 
analysis process from scratch on the larger data set. The 
current fastest method in such a case utilizes the L-BFGS 
quasi-Newton minimization algorithm with AllReduce 
along a distributed cluster, ( Agarwal et al., 2012 1. The 



other extreme is to apply the method of online learning, 
which considers one data point at a time and updates the 
parameters 6 according to some form of gradient descent, 
see ( [Langford et al., 2009| l, ( |Duchi et al., 2010| l. How- 
ever, on the tera-scale, neither approach is as appealing or 
as fast as we can achieve with our method. We describe in 
a bit more detail these recent approaches to solving ([TJ in 
Section |3] For completeness, we also refer the reader to 
recent work relating to large-scale optimization contained 
in ( [Schraudolph et al., 2007^ and dBottou, 2010l l. 

Our approach in simple terms lies somewhere be- 
tween pure L-BFGS minimization (widely accepted as 
the fastest brute force optimization algorithm whenever 
the function is convex and smooth) and online learning. 
While L-BFGS offers accuracy and robustness with a 
relatively small number of iterations, it fails to take di- 
rect advantage of situations where the new data is not 



very different from that acquired previously or situations 
where the new data is extremely different than the old 
data. Certainly, one can initiate a new optimization job 
on the larger data set with the parameter 6 initialized to 
the previous result. But we are left with the problem of 
optimizing over increasingly larger training sets at one 
time. Similarly, online learning methods only consider 
one data point at a time and cannot reasonably change 
the parameter 9 by too much at any given step without 
risk of severely increasing the regret. It also cannot typ- 
ically reach as small of an error count as that of a global 
gradient descent approach. On the other hand, we will 
show that it is possible to combine the advantages of 
both methods: in particular the small number of iterations 
and speed of L-BFGS when applied to reasonably sized 
batches, and the ability of online learning to "forget" pre- 
vious data when the new data has changed significantly. 

The outline of the paper is as follows. In Section |2] 
we describe the general problem of interest. In Sec- 
tion [3] we briefly mention current widely used methods 
of solving ([Til. In Section |4] we outline the statistically 
adaptive learning algorithm. Finally, in Section |5] we 
benchmark the performance of our two related methods 
(Context Relevant FAST L-BFGS and SA L-BFGS, re- 
spectively) against Vowpal Wabbit - one of the fastest 
currently available routines which incorporates the work 



respect to a fixed (optimal) parameter 0* (in practice we 
don't know the true value of 9*) by 



of (Agarwal et al., 2012 1. We also include the associated 
Area Under Curve (AUC) rating, which roughly speak- 
ing, is a number in [0, 1], where a value of 1 indicates 
perfect prediction by the model. 

2 Background and Problem Setup 

In this paper the underlying problem is as follows. Sup- 
pose we have a sequence of time-indexed data sets 
{XuY,} where X, = Y, = {y«}I^\, t = 

0, 1, . . . , is the time index, and rat G N is the batch 
size (typically independent of t). Such data is given se- 
quentially as it is acquired (e.g. t could represent days), 
so that at t — one only has possession of {Xq, Yq}- Al- 
ternatively, if we are given a large data set all at once, we 
could divide it into batches indexed sequentially by t. In 
general, we use the notation xt with subscript t to denote 
a time-dependent vector at time t, and we write xtj to 
denote the jth component of xt- For each t = 0, . . . ,tf 
we define 



i=l 

M0)^M9) + XS{9), 



(2) 



where as before, I is a given smooth convex loss function 
and S is a regularizes Also let 9t be the parameter vector 
obtained at time t, which in practice will approximately 
minimize XS{9) + J21=q fs{9)- We define the regret with 



= r^{9\t) ^[</),(0,) - c^s{e*)] (3) 

s=0 

t 

= Y}^s{9s) + XS{9,) - fs{9*) - XS{9*)]. 

s=Q 

An effective algorithm is then one in which the sequence 
{^t}tLo suffers sub-linear regret, i.e., Rtjj{t) = o{t). 

As mentioned earlier, there has been much work done 
regarding how to solve ([T]i with a variety of meth- 
ods. Before proceeding with a basic overview of the 
two most popular approaches to large-scale machine 
learning in Section [3] it is important to understand the 
underlying assumptions and implications of the pre- 
vious body of work. In particular, we mention the 
work of Leon Bottou in ( |Bottou, 2010| l regarding large- 
scale optimization with stochastic gradient descent, and 



( Schraudolph et al., 2007 1 regarding stochastic online L- 
BFGS (oL-BFGS) optimization. Such works and others 
have demonstrated that with a lot of randomly shuffled 
data, a variety of methods (oL-BFGS, 2nd order stochas- 
tic gradient descent and averaged stochastic gradient de- 
scent) can work in fewer iterations than L-BFGS because: 

(a) Small data learning problems are fundamentally dif- 
ferent from large data learning problems; 

(b) The cost functions as framed in the literature have 
well suited curvature near the global minimum. 

We remark that the key problem for all quasi-Newton 
based optimization methods (including L-BFGS) has 
been that noise associated with the approximation pro- 
cess - with specific properties dependent on each learning 
problem - causes adverse conditions which can make L- 
BFGS (and its variants) fail. However, the problem of 
noise leading to non-positive curvature near the minimum 
can be averted if the data is appropriately shaped (i.e. fea- 
ture selection plus proper data transformations). For now 
though, we ignore the issue and assume we already have 
a methodology for "feature shaping" that assures under 
operational conditions that the curvature of the resulting 
learning problem is well-suited to the algorithm that we 
describe. 

3 Previous Work 
3.1 Online Updates 

In online learning, the problem of storage is completely 
averted as each data point is discarded once it is read. 
We remark that one can essentially view this approach 
as a special case of the statistically adaptive method de- 
scribed in Section m with a batch size of 1. Such algo- 
rithms iteratively make a prediction 9t £ R' and then 



receive a convex loss function (pt as in (|2]). Typically, 
(ptiO) — l{9^xt\yt) + \R{9), where {xt,yt) is the data 
point read at time t. We then make an update to obtain 
6*4+1 using a rule that is typically based on the gradi- 
ent of l{6^xt,yt) in 9. Indeed, the simplest approach 
(with no regularization term) would be the update rule 
6*4+1 = 9t-r]tVgli0fxt,yt). 

However, there currently exist more sophisticated up- 
date schemes which can achieve better regret bounds for 
(|3]l. In particular, the work of Duchi, Hazan, and Singer 
is a type of subgradient method with adaptive proximal 
functions. It is proven that their ADAGRAD algorithm 
can achieve theoretical regret bounds of the form 



R4t)=Oi\\9*htTiGl^^)) and 
R^t) = O frnaxll^?, - 9*\\2iv[G]'^) 

\ S<t 



(4) 



where in general, Gt — Yll=o 9s9^ is an outer prod- 
uct matrix generated by the sequence of gradients = 
\7efs{9s) (,Duchiet al., 20I0| l. We remark that since the 
loss function gradients converge to zero under ideal con- 
ditions, the estimate (|4|l is indeed sublinear, because the 
decay of the gradients, however slow, counters the hnear 
growth in the size of G 



1/2 



3.2 Vowpal Wabbit with Gradient Descent 

Vowpal Wabbit is a freely available software pack- 
age which implements the method described briefly in 



( Agarwal et al., 2012| . In particular, it combines online 
learning and brute force gradient-descent optimization in 
a slightly different way. First, one does a single pass 
over the whole data set using online learning to obtain 
a rough choice of parameter 9. Then, L-BFGS optimiza- 
tion of the cost function is initiated with the data spUt 
across a cluster The cost function and its gradient are 
computed locally and AllReduce is utilized to collect the 
global function values and gradients in order to update 9. 
The main improvement of this algorithm over previous 
methods is the use of AllReduce with the Hadoop file 
structure, which significantly cuts down on communi- 
cation time as is the case with MapReduce. Moreover, 
the baseline online learning step is done with a learn- 
ing rate chosen in an optimal manner as discussed in 
( Karampatziakis and Langford, 201 l[ i. 



4 Our Approach 

4.1 Least Squares Digression 

Before we describe the statistically adaptive approach for 
minimizing a generic cost function, consider the follow- 
ing simpler scenario in the context of least squares regres- 
sion. Given data {X, Y} with X e M™^' and F G M", 
we want to choose 9 that solves miuggai \\X9 — Y\\l. 



Assuming invertibility of X^X, it is well known that the 
solution is given by 

9 = {X'^Xy^X'^Y. (5) 

Now, suppose that we have time indexed data 



and Ys £ 



In order 



{Xs,Ys}^^o withX, G 
to update 9t given {Xf+i, Yj+i}, first we must check how 
well 9t fits the newly augmented data set. We do this by 
evaluating 



t+i 



J2\\Xs9-Y\\l 



(6) 



s=0 



with 9 = 9t. Depending on the result, we choose a 
parameter A G [0, 1] that determines how much weight 
to give the previous data when computing 9t+i. That 
is, A represents how much we would like to "forget" the 
previous data (or emphasize the new data), with a value of 
A = 1 indicating that all previous data has been thrown 
out. Similarly, the case A = ^ corresponds to the case 
when past and present are weighed equally, and the case 
A = 1 corresponds to the case when 9t fits the new data 
perfectly (i.e. (|6]l is equal to zero). 



Let 



Xi 



be 

rT]T 



the X]s=o ™s X I matrix 
[Xq , Xl , . . . , Xj ] obtained by concatenation, 
and similarly define the length-^*^Q ttIs vector 
Y[o^] [Y^^ Y^, . . . , Y^^]'^. Then © is equivalent to 
||X[o,t+i]6' — ^[0,4+1] Hi- Now, when using a particular 
second order Newton method for minimizing a smooth 
convex function, the computation of the inverse Hessian 
matrix is analogous to computing {X"^ ^■^X[Q f])~^ 
above. As t grows large, the cumulative normal matrix 
Xj^jjX[o,t] becomes increasingly costly to compute 
from scratch, as does its inverse. Fortunately, we observe 
that 

^[o,t+i]^[o,t+i] = ^'^,t]^lo,t] + Xj_^^Xt+i. (7) 

However, if we want to incorporate the flexibility to 
weigh current data differently relative to previous data, 
we need to abandon the exact computation of (|7]i. In- 
stead, letting At denote the approximate analogue of 



[0^4], we introduce the update 
2 



t+i 



1 



(8) 



where /i satisfies A 



The actual update of 9t is performed as follows. Define 



Y. 



[O.t] 



^[o,t]^[o,tl 



s=0 



[Xq , . . .,Xj] 



Yt 



Up to time t, the standard solution to the least squares 
problem on the data {X^o.t], ^[o,*]} is then 



(9) 



Now let Bt be an approximation to Y[o,t] . We define Bt+i 
by the update 



Bt+i 



Finally, we set 



{fi^Bt + XJl,Yt+,) 



9t+i ^t+i-B, 



t+i- 



(10) 



It is easily verified that ( fTOb coincides with the standard 
update (|9]l when /i = 1. 

4.2 Statistically Adaptive Learning 

Returning to our original problem, we start with the pa- 
rameter 6*0 obtained from some initial pass through of 
{Xq, Yq}, typically using a particular gradient descent al- 
gorithm. In what follows, we will need to define an easily 
evaluated error function to be applied at each iteration, 
mildly related to the cumulative regret (O: 



lit, 



yi 



(11) 



We remark that I{t, 9) represents the relative number of 
incorrect predictions associated with the parameter 9 over 
all data points from time s = to s = t. Moreover, 
because pg is a linear function of x, I is very fast to 
evaluate (essentially 0{m) where m — ^*^o 

Given 9t, we compute I{t + l,0f). There are two 
extremal possibilities: 

1. + 1, ^t) is s/gn/^canfZy larger than /(t, 6*4). More 
precisely, we mean that I{t + l,0f) — I{t,9t) > 
a{t), where a{t) is the standard deviation of 
{/(s, 0s)}*=o- 111 '^his case the data has significantly 
changed, and so 9 must be modified. 

2. Otherwise, there is no need to change 9 and we set 

In the former case, we use the magnitude of I{t + l,9t) — 
I{t, 9 1) to determine a subsample of the old and new data 
with Mold and Afncw points chosen, respectively (see 
Figure[TJ. Roughly speaking, the larger the difference the 
more weight will be given to the most recent data points. 
The sampling of previous data points serves to anchor 
the model so that the parameters do not over fit to the 
new batch at the expense of significantly increasing the 
global regret. This is a generalization of online learning 
methods where only the most recent single data point 



batch 
1,2 t-1 




data points selected 
from previous batches 



batch t 



data points selected 
from current batch 



Figure 1 : Subsampling of the partitioned data stream at 
time t and times 0, 1, . . . , t — 1, respectively. 



is used to update 9. From the subsample chosen, we 
then apply a gradient descent optimization routine where 
the initialization of the associated starting parameters is 
generated from those stored from the previous iteration. 
In the case of L-BFGS, the rank 1 matrices used to ap- 
proximate the inverse Hessian stored from the previous 
iteration are used to initialize the new descent routine. 
We summarize the process in Algorithm[T] 

Algorithm 1 Statistically Adaptive Learning Method (S A 

L-BFGS) 

Require: Error checking function I{t, 9) 
Given data 

Run gradient descent optimization on {Xi^^Yo} to 

compute 6*0 

for t = 1 to do 

ifl{t + l,9t)- I{t, 9t) > (7{t) then 

Choose Afoid and Mnow 

Subsample data 

Run L-BFGS with initial parameter 9t to ob- 
tain 6*4+1 

else 9t+i 4- 9t 
end if 
end for 

As a typical example, at some time t we might have 
A/old = 1000, Mnow = 100, 000, and ^*io rut = I-IO^. 
This would be indicative of a batch {Xt+i,Yt+i} with 
significantly higher error using the current parameter 9t 
than for previously analyzed batches. 

We remark that when learning on each new batch of 
data, there are two main aspects that can be parallelized. 
First, the batch itself can be partitioned and distributed 
among nodes in a cluster via AllReduce to significantly 
speed up the evaluation of the cost function and its gra- 



dient as is done in (Agarwal et al., 2012i. Furthermore, 
one can run multiple independent optimization routines 
in parallel where the distribution used to subsample from 
Xt+i and uI^qXs is varied. The resulting parameters 
9 obtained from each separate instance can then be sta- 



tistically compared so as to make sure that the model is 
not overly sensitive to the choice of sampling distribu- 
tion. Otherwise, having 9 be too highly dependent on the 
choice of subsample would invalidate using a stochas- 
tic gradient descent-based approach. A bi-product of 
this ability to simultaneously experiment with different 
samplings is that it provides a quick means to check the 
consistency of the data. 

Finally, we remark that the SA L-BFGS method can be 
reasonably adapted to account for changes in the selected 
features as new data is acquired. Indeed, it is very ap- 
pealing within the industry to be able to experiment with 
different choices of features in order to find those that 
matter most, while still being able to use the previously 
computed parameters 9f to speed up the optimization on 
the new data. Of course, it is possible to directly ap- 
ply an online learning approach in this situation, since 
previous data points have already been discarded. But 
typical gradient descent algorithms do not a priori have 
the flexibility to be directly applied in such cases and they 
typically perform worse than batch methods such as L- 
BFGS( |AgSwal et al., 20T2) i. 



5 Experiments 

5.1 Description of Dataset and Features 

We consider data used to predict the click-through-rate 
(pCTR) of online ads. An accurate model is necessary 
in the search advertising market in order to appropriately 
rank ads and price clicks. The data contains 11 variables 
and 1 output, corresponding to the number of times a 
given ad was clicked by the user among the number of 
times it was displayed. In order to reduce the data size, 
instances with the same user id, ad id, query, and setting 
are combined, so that the output may take on any posi- 
tive integer value. For each instance (training example), 
the input variables serve to classify various properties of 
the ad displayed, in addition to the specific search query 
entered. This data was acquired from sessions of the Ten- 
cent proprietary search engine and was posted pubUcly on 
[www . kddcup .2012. org ( |Tencent, 20"T2| . 

For these experiments we build a basic model that 
learns from the identifiers provided in the training set. 
These include unique identifiers for each query, ad, key- 
word, advertiser, title, description, display url, user, ad 
position, and ad depth (further details available in the 
KDD documentation). We compute a position and depth 
normalized click through rate for each identifier, as well 
as combinations (conjunctions) of these identifiers. Then 
at training and testing time we annotate each example 
with these normalized click through rates. Additionally, 
before running the optimization, it is necessary to build 
appropriate feature vectors (i.e. shape the data). We will 
not go into detail regarding how this is done, except to 



mention that the number of features generated is on the 
order of 1000. 

5.2 Model 1 Results 

For our first set of experiments, we compare the perfor- 
mance of Vowpal Wabbit (VW) using its L-BFGS im- 
plementation and the Context Relevant Flexible Analyt- 
ics and Statistics Technology^'^ L-BFGS implementation 
running on 10 Amazon ml.xlarge instances. The time 
measured (in seconds) is only the time required to train 
the models. The features are generated and cached for 
each implementation prior to training. 

Performance was measured using the Area Under 
Curve (AUC) metric because this was the methodology 
used in ( [Tencent, 2012j ). In short, the AUC is equal to the 
probability that a classifier will rank a randomly chosen 
positive instance higher than a randomly chosen negative 
one. More precisely, it is computed via Algorithm 3 in 
( |Fawcett, 2004| l. We compute our AUC results over a 
portion of the public section of the test set that has about 
2 million examples. 

Model 1 includes only the basic id features, with no 
conjunction features, and achieves an AUC of 0.748 as 
shown in Table [T] A simple baseline performance, which 
can be generated by predicting the mean ctr for each 
ad id would perform at approximately an AUC of 0.71. 
The winner of ( |Tencent, 20I2| l performed at an AUC of 
0.80. However, the winning model was substantially 
more complicated and used many additional features that 
were excluded from this simple demonstration. In future 
work, we will explore more sophisticated feature sets. 

The Context Relevant and VW models both achieve the 
same AUC on the test set, which validates that the basic 
gradient descent and L-BFGS implementations are func- 
tionally equivalent. The Context Relevant model com- 
pletes learning between four and five times more quickly. 
Our implementation is heavily optimized to reduce com- 
putation time as well as memory footprint. In addition, 
our implementation utilizes an underlying MapReduce 
implementation that provides robustness to job and node 
failure^ 

5.3 Model 2 Results 

Context Relevant's implementation of SA L-BFGS is 
designed to accelerate and simplify learning iterative 
changes to models. Using information gleaned from the 
initial L-BFGS pass, SA L-BFGS develops a sampling 



'Context Relevant had to re- write the AllReduce network 
implementation to add error checking so that the AllReduce sys- 
tem was robust to errors that were encountered during normal 
execution of experiments on Amazon's EC2 systems. Without 
these changes, we could not keep AllReduce from hanging dur- 
ing the experiments. There is no graceful recovery from the loss 
of a single node. 



Table 1 : Model 1 Results For Different Learning Mech- 
anisms (VW = Vowpal Wabbit; CR = Context Relevant 
FAST L-BFGS 

VW CR 
L-BFGS L-BFGS 



seconds 



490 



114 



Table 2: Model 2 Results For Different Learning Mech- 
anisms (VW = Vowpal Wabbit; CR = Context Relevant 
FAST L-BFGS; SA = Context Relevant FAST SA L- 
BFGS) 

VW CR CR 

L-BFGS L-BFGS SA L-BFGS 



AUC 


.748 


.748 


seconds 


515 


115 


9 








AUC 


.750 


.752 


.751 



strategy to minimize sampling induced noise when learn- 
ing new models that are derived from previous models. 
The larger the divergence in the models, the less speed- 
up is likely. For this set of experiments, we add a con- 
junction feature that captures the interaction between a 
query id and an ad id, which has frequently been an 
important feature in well known advertising systems. We 
then compare the speed and accuracy of Vowpal Wab- 
bit (VW) using its L-BFGS implementation; the Context 
Relevant Flexible Analytics and Statistics Technology^'^ 
L-BFGS implementation, and the Context Relevant Flex- 
ible Analytics and Statistics Technology™ SA L-BFGS 
implementation (SA) running on 10 Amazon XI. Large 
instances. Here the baseline L-BFGS models are trained 
with the standard practice for adding a new feature, the 
models are retrained on the entire dataset. The time mea- 
sured (in seconds) is only the time required to train the 
models. The features are generated and cached for each 
implementation prior to training. 

Again, performance was measured using the Area Un- 
der Curve (AUC) metric. Table|2]lists the results for each 
algorithm, and shows that AUC improves in comparison 
to Model 1 when the new feature is added. As with Model 
1, the VW and CR models achieve similar AUC and the 
basic L-BFGS CR learning time is significantly faster. 
Furthermore, we show that SA L-BFGS also achieves 
similar AUC, but in less than one tenth of the time (which 
likewise implies one tenth of the compute cost required). 
This speed up can enable a large increase in the number 
of experiments, without requiring additional compute or 
time. It is important to note that the speed of L-BFGS 
and SA L-BFGS is essentially tied to the number of 
features and the number of examples for each iteration. 
The primary performance gains that can be found are: (a) 
reducing the number of iterations; (b) reducing the num- 
ber of examples; or (c) reducing the number of features 
with non-zero weights. SA L-BFGS adopts the former 
two strategies. A reduction in the number of features 
with non-zero weights can be forced through aggressive 
regularization, but at the expense of specificity. 

6 Conclusions 

We have presented a new tera-scale machine learning 
system that enables rapid model experimentation. The 



system uses a new version of L-BFGS to combine the 
robustness and accuracy of second order gradient descent 
optimization methods with the memory advantages of 
online learning. This provides a model building envi- 
ronment that significantly lowers the time and compute 
cost of asking new questions. The ability to quickly ask 
and answer experimental questions vastly expands to set 
of questions that can be asked, and therefore the space 
of models that can be explored to discover the optimal 
solution. 

SA L-BFGS is also well suited to environments where 
the underlying distribution of the data provided to a learn- 
ing algorithm is shifting. Whether the shift is caused 
by changes in user behavior, changes in market pricing, 
or changes in term usage, SA L-BFGS can be empiri- 
cally tuned to dynamically adjust to the changing con- 
ditions. One can utilize the parallelized approach in 



( Agarwal et al., 2012 1 on each batch of data in the time 
variable, with the additional freedom to choose the batch 
size. Furthermore, the statistical aspects of the algorithm 
provide a useful way to check the consistency of the data 
in real time. However, like all L-BFGS implementa- 
tions that rely on small, reduced, or sampled data sets, 
increased sampling noise from the L-BFGS estimation 
process affects the quality of the resulting learning al- 
gorithm. Users of this new algorithm must take care to 
provide smooth convex loss functions for optimization. 
The Context Relevant Flexible Analytics and Statistics 
Technology ^'^ is designed to provide such functions for 
optimization. 
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vision broadcast industry. He is recognized as one of 
the world's leading network performance experts and has 
been granted twenty two technology patents that range 
from performance to user interface design. Jim earned a 
bachelor of science degree in computing science from the 
University of Alberta. 

Dustin Rigg Hillard - Director of Engineering and 
Data Scientist - Dustin is a recognized data science and 
machine learning expert who has published more than 30 
papers in these areas. Previously at Microsoft and Ya- 
hoo!, he spent the last decade building systems that sig- 
nificantly improve large-scale processing and machine- 
learning for advertising, natural language and speech. 

Prior to joining Context Relevant, Dustin Hillard 
worked for Microsoft, where he worked to improve 
speech understanding for mobile applications and XBox 
Kinect. Before that he was at Yahoo! for three 
years, where he focused on improving ad relevance. 
His research in graduate school focused on automatic 
speech recognition and statistical translation. Dustin 
incorporates approaches from these and other fields to 
learn from massive amounts of data with supervised, 
semi-supervised, and unsupervised machine learning ap- 
proaches. 

Dustin holds a bachelor's and master's degree as well 
as a PhD in Electrical Engineering from the University of 
Washington. 

Scott Golder - Data Scientist & Staff Sociologist - 
Scott mines social networking data to investigate broad 
questions such as when people are happiest (mornings 
and weekends) and how Twitter users form new social 
ties. His work has been published in the journal Science 
as well as top computer science conferences by the ACM 
and IEEE, and has been covered in media outlets such 



as MSNBC, The New York Times, The Washington Post 
and National Public Radio. He has also been profiled by 
LiveScience's "ScienceLives". 

He has worked as a research scientist in the Social 
Computing Lab at HP Labs and has interned at Google, 
IBM and Microsoft. Scott holds a master's degree from 
MIT, where he worked with the Media Laboratory's So- 
ciable Media Group, and graduated from Harvard Univer- 
sity, where he studied Linguistics and Computer Science. 
Scott is currently on leave from the PhD program in So- 
ciology at Cornell University. 

Mark Hubenthal - Member of the Technical Staff - 
Mark recently received his PhD in Mathematics from the 
University of Washington. He works on inverse problems 
applicable to medical imaging and geophysics. 

Scott Smith - Principal Engineer and Architect - Scott 
has experience with distributed computing, compiler de- 
sign, and performance optimization. At Akamai, he 
helped design and implement the load balancing and 
failover logic for the first CDN. At Clustrix, he built the 
SQL optimizer and compiler for a distributed RDBMS. 
He holds a masters degree in computer science from MIT. 
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